We will be attempting to find a linear regression that models college tuition rates, based on a dataset from US News and World Report. Alas, this data is from 1995, so it is very outdated; still, we will see what we can learn from it.
http://kbodwin.web.unc.edu/files/2016/09/tuition_final.csv; figure out how to use the code you were given last time for read.csv( ) and read.table( ) to read the data into R and call it tuition. Use the functions we learned last time to familiarize yourself with the data in tuition. #Your Code Here
tuition = read.csv("http://kbodwin.web.unc.edu/files/2016/09/tuition_final.csv")
head(tuition)
## ID Name State Public Avg.SAT Avg.ACT
## 1 1061 Alaska Pacific University AK 2 972 20
## 2 1063 University of Alaska at Fairbanks AK 1 961 22
## 3 1065 University of Alaska Southeast AK 1 NA NA
## 4 11462 University of Alaska at Anchorage AK 1 881 20
## 5 1002 Alabama Agri. & Mech. Univ. AL 1 NA 17
## 6 1003 Faulkner University AL 2 NA 20
## Applied Accepted Size Out.Tuition Spending
## 1 193 146 249 7560 10922
## 2 1852 1427 3885 5226 11935
## 3 146 117 492 5226 9584
## 4 2065 1598 6209 5226 8046
## 5 2817 1920 3958 3400 7043
## 6 345 320 1367 5600 3971
summary(tuition)
## ID Name State Public
## Min. : 1002 Bethel College : 4 NY :101 Min. :1.000
## 1st Qu.: 1874 Concordia College: 4 PA : 83 1st Qu.:1.000
## Median : 2650 Trinity College : 4 CA : 70 Median :2.000
## Mean : 3126 Columbia College : 3 TX : 60 Mean :1.639
## 3rd Qu.: 3431 Union College : 3 MA : 56 3rd Qu.:2.000
## Max. :30431 Augustana College: 2 OH : 52 Max. :2.000
## (Other) :1282 (Other):880
## Avg.SAT Avg.ACT Applied Accepted
## Min. : 600.0 Min. :11.00 Min. : 35.0 Min. : 35.0
## 1st Qu.: 884.5 1st Qu.:20.25 1st Qu.: 695.8 1st Qu.: 554.5
## Median : 957.0 Median :22.00 Median : 1470.0 Median : 1095.0
## Mean : 968.0 Mean :22.12 Mean : 2752.1 Mean : 1870.7
## 3rd Qu.:1038.0 3rd Qu.:24.00 3rd Qu.: 3314.2 3rd Qu.: 2303.0
## Max. :1410.0 Max. :31.00 Max. :48094.0 Max. :26330.0
## NA's :523 NA's :588 NA's :10 NA's :11
## Size Out.Tuition Spending
## Min. : 59 Min. : 1044 Min. : 1834
## 1st Qu.: 966 1st Qu.: 6111 1st Qu.: 6116
## Median : 1812 Median : 8670 Median : 7729
## Mean : 3693 Mean : 9277 Mean : 8988
## 3rd Qu.: 4540 3rd Qu.:11659 3rd Qu.:10054
## Max. :31643 Max. :25750 Max. :62469
## NA's :3 NA's :20 NA's :39
tuition called Acc.Rate that contains the acceptance rate for each university. You may find it handy to use the variables ‘Applied’ and ‘Accepted’. # YOUR CODE HERE
tuition$Acc.Rate = tuition$Accepted / tuition$Applied
head(tuition)
## ID Name State Public Avg.SAT Avg.ACT
## 1 1061 Alaska Pacific University AK 2 972 20
## 2 1063 University of Alaska at Fairbanks AK 1 961 22
## 3 1065 University of Alaska Southeast AK 1 NA NA
## 4 11462 University of Alaska at Anchorage AK 1 881 20
## 5 1002 Alabama Agri. & Mech. Univ. AL 1 NA 17
## 6 1003 Faulkner University AL 2 NA 20
## Applied Accepted Size Out.Tuition Spending Acc.Rate
## 1 193 146 249 7560 10922 0.7564767
## 2 1852 1427 3885 5226 11935 0.7705184
## 3 146 117 492 5226 9584 0.8013699
## 4 2065 1598 6209 5226 8046 0.7738499
## 5 2817 1920 3958 3400 7043 0.6815761
## 6 345 320 1367 5600 3971 0.9275362
summary(tuition)
## ID Name State Public
## Min. : 1002 Bethel College : 4 NY :101 Min. :1.000
## 1st Qu.: 1874 Concordia College: 4 PA : 83 1st Qu.:1.000
## Median : 2650 Trinity College : 4 CA : 70 Median :2.000
## Mean : 3126 Columbia College : 3 TX : 60 Mean :1.639
## 3rd Qu.: 3431 Union College : 3 MA : 56 3rd Qu.:2.000
## Max. :30431 Augustana College: 2 OH : 52 Max. :2.000
## (Other) :1282 (Other):880
## Avg.SAT Avg.ACT Applied Accepted
## Min. : 600.0 Min. :11.00 Min. : 35.0 Min. : 35.0
## 1st Qu.: 884.5 1st Qu.:20.25 1st Qu.: 695.8 1st Qu.: 554.5
## Median : 957.0 Median :22.00 Median : 1470.0 Median : 1095.0
## Mean : 968.0 Mean :22.12 Mean : 2752.1 Mean : 1870.7
## 3rd Qu.:1038.0 3rd Qu.:24.00 3rd Qu.: 3314.2 3rd Qu.: 2303.0
## Max. :1410.0 Max. :31.00 Max. :48094.0 Max. :26330.0
## NA's :523 NA's :588 NA's :10 NA's :11
## Size Out.Tuition Spending Acc.Rate
## Min. : 59 Min. : 1044 Min. : 1834 Min. :0.09139
## 1st Qu.: 966 1st Qu.: 6111 1st Qu.: 6116 1st Qu.:0.68122
## Median : 1812 Median : 8670 Median : 7729 Median :0.78261
## Mean : 3693 Mean : 9277 Mean : 8988 Mean :0.75479
## 3rd Qu.: 4540 3rd Qu.:11659 3rd Qu.:10054 3rd Qu.:0.86087
## Max. :31643 Max. :25750 Max. :62469 Max. :1.00000
## NA's :3 NA's :20 NA's :39 NA's :13
# YOUR CODE HERE
tuition[tuition$Name == "University of North Carolina at Chapel Hill",]
## ID Name State Public Avg.SAT
## 682 2974 University of North Carolina at Chapel Hill NC 1 1121
## Avg.ACT Applied Accepted Size Out.Tuition Spending Acc.Rate
## 682 NA 14596 5985 14609 8400 15893 0.4100438
We have seen many examples of using functions in R, like summary( ) or t.test( ). Now you will learn how to write your own functions. Defining a function means writing code that looks something like this:
my_function <- function(VAR_1, VAR_2){
# do some stuff with VAR_1 and VAR_2
return(result)
}
Then you run the code in R to “teach” it how your function works, and after that, you can use it like you would any other pre-existing function. For example, try out the following:
add1 <- function(a, b){
# add the variables
c = a + b
return(c)
}
add2 <- function(a, b = 3){
# add the variables
c = a + b
return(c)
}
# Try adding 5 and 7
add1(5, 7)
## [1] 12
add2(5, 7)
## [1] 12
# Try adding one variable
#add1(5)
add2(5)
## [1] 8
What was the effect of b = 3 in the definition of add2( )?
The default value for b is 3
Write your own functions, called beta1( ) and beta0( ) that take as input some combination of Sx, Sy, r, y_bar, and x_bar, and use that to calculate \(\beta_1\) and \(\beta_0\).
# YOUR CODE HERE
beta1 = function(r, sY, sX) {
if (identical(sX, 0)) {
print("sX cannot be 0")
return(NA)
}
return(r * sY / sX)
}
beta0 = function(r, sY, sX, y_bar, x_bar) {
return(ybar - beta1(r, sY, sX) * x_bar)
}
beta1(1, 1, 0)
## [1] "sX cannot be 0"
## [1] NA
Sx = 0. Did it work? If not, fix your function code. Explain why it would be a problem to do linear regression with \(S_X = 0\).Cannot divide by zero
Use the code below to make a scatterplot of college tuition versus average SAT score.
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "title", xlab = "label", ylab = "label", pch = 7, cex = 2, col = "blue")
plot( ) so that it looks nice. # YOUR CODE HERE
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition vs Average SAT Scores", ylab = "Tuition", xlab = "Average SAT Score", pch = 1, cex = 1, col = "blue")
pch and cex do?pch specifies the symbols used
cex indicates the amount by which plotting text and symbols are scaled relative to the default size
abline( ) to add a vertical line or a horizontal line to a graph. However, it can also add lines by slope and intercept. Read the documentation of abline( ) until you understand how to do this. Then add a line with slope 10 and intercept 0 to your plot. # YOUR CODE HERE
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition vs Average SAT Scores", ylab = "Tuition", xlab = "Average SAT Score", pch = 1, cex = 1, col = "blue")
abline(a=0, b=10)
Not particularly well
beta1( ) and beta0( ), to find the slope and intercept for a regression line of Avg.SAT on Out.Tuition. Remake your scatterplot, and add the regression line.(Hint: You may have some trouble finding the mean and sd because there is some missing data. Look at the documentation for the functions you use. What could we add to the function arguments to ignore values of NA?)
# YOUR CODE HERE
tuition = tuition[!is.na(tuition$Out.Tuition) & !is.na(tuition$Avg.SAT),]
ybar = mean(tuition$Out.Tuition)
xbar = mean(tuition$Avg.SAT)
sY = sd(tuition$Out.Tuition)
sX = sd(tuition$Avg.SAT)
r = cor(tuition$Avg.SAT, tuition$Out.Tuition)
b0 = beta0(r, sY, sX, ybar, xbar)
b1 = beta1(r, sY, sX)
paste(ybar, xbar, sY, sX, r, b1, b0)
## [1] "9974.48109517601 967.196870925684 4001.20564002006 122.616914275463 0.606700481134289 19.7977041035616 -9173.79636530133"
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition vs Average SAT Scores", ylab = "Tuition", xlab = "Average SAT Score", pch = 1, cex = 1, col = "blue")
abline(a=b0, b=b1)
I can conclude that there is a correlation between average SAT score and out of state tuition
predict_yval(X, Y, x_new) that takes as input a vector of explanatory variables (X), a vector of y-variables (Y), and a new x-value that we want to predict (x_new). The output of the function should be the predicted y-value for x_new from a regression line. (Hint: You can use functions inside functions.)predict_yval <- function(X, Y, x_new) {
X = na.omit(X)
Y = na.omit(Y)
sY = sd(Y)
sX = sd(X)
r = cor(X, Y)
return(beta1(r, sY, sX)*x_new + beta0(r, sY, sX, mean(Y), mean(X)))
}
# YOUR CODE HERE
duke_sat_avg = tuition[tuition$Name == "Duke University",]$Avg.SAT
unc_sat_avg = tuition[tuition$Name == "University of North Carolina at Chapel Hill",]$Avg.SAT
duke_tuition_avg = tuition[tuition$Name == "Duke University",]$Out.Tuition
unc_tuition_avg = tuition[tuition$Name == "University of North Carolina at Chapel Hill",]$Out.Tuition
duke_tuition_prediction = predict_yval(tuition$Avg.SAT, tuition$Out.Tuition, duke_sat_avg)
unc_tuition_prediction = predict_yval(tuition$Avg.SAT, tuition$Out.Tuition, unc_sat_avg)
duke_sat_prediction = predict_yval(tuition$Out.Tuition, tuition$Avg.SAT, duke_tuition_avg)
unc_sat_prediction = predict_yval(tuition$Out.Tuition, tuition$Avg.SAT, unc_tuition_avg)
cat(duke_sat_avg, unc_sat_avg, duke_tuition_avg, unc_tuition_avg)
## 1302 1121 18590 8400
cat(duke_sat_prediction, unc_sat_prediction, duke_tuition_prediction, unc_tuition_prediction)
## 10134.66 9945.208 16602.81 13019.43
Not getting a good deal at Duke, getting a great deal at UNC
lm() and diagnosticsYou now have functions to calculate the slope and intercept of a linear regression, and to predict values. As you might expect, R was already able to do this, using the function lm( ). In class, you saw how to read the output of lm( ). Run the following regression of Avg.SAT on Out.Tuition, and refamiliarize yourself with the output.
# Make linear model
my_lm = lm(Out.Tuition ~ Avg.SAT, data = tuition)
summary(my_lm)
##
## Call:
## lm(formula = Out.Tuition ~ Avg.SAT, data = tuition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9809.9 -2234.6 204.6 2309.3 12336.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9173.7964 914.3482 -10.03 <2e-16 ***
## Avg.SAT 19.7977 0.9379 21.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3183 on 765 degrees of freedom
## Multiple R-squared: 0.3681, Adjusted R-squared: 0.3673
## F-statistic: 445.6 on 1 and 765 DF, p-value: < 2.2e-16
names(my_lm)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition vs Average SAT Scores", ylab = "Tuition", xlab = "Average SAT Score", pch = 1, cex = 1, col = "blue")
abline(my_lm)
plot(my_lm)
Check out names(my_lm). This will give you a list of information we can access using $. For example, compare my_lm$coefficents to your beta1 and beta0 outputs.
The output of lm( ) is made to play nicely with other functions in R. For example, try adding abline(my_lm) to your scatterplot. We can also use lm( ) to check some common diagnostics, to see if the linear model is a good fit for the data. Try plot(my_lm), and familiarize yourself with the first three plots that are automatically generated. (The fourth is not covered in this course, so you do not need to worry about it for now.)
Spending contains the expenditure of the school per student. Suppose we want to make a regression that predicts tuition cost from the expenditure per student. Make a linear model for Spending versus Out.Tuition. Comment on the summary of the model, and plot it on an appropriate scatter plot. Does the model seem to be a good fit for this data?my_lm = lm(Out.Tuition ~ Spending, data = tuition)
plot(tuition$Spending, tuition$Out.Tuition, main = "Tuition vs Spending", ylab = "Tuition", xlab = "Spending", pch = 1, cex = 1, col = "blue")
abline(my_lm)
The model is not a very good fit. There are extreme outliers that are skewing the result.
Spending. Do you notice any issues? Hint: Use your own function predict_yval() or the built-in function predict(my_lm). You will need to think about the problem of missing data (NAs).plot(na.omit(tuition$Spending), na.omit(my_lm$residuals), main = "Residuals vs Spending", ylab = "Residuals", xlab = "Spending", col = "blue")
plot() to look at the automatic diagnostics. What is each plot showing? What seems to be going wrong here? Which schools are marked as outliers?plot(my_lm)
The high-spending outlier schools are heavily skewing the model
tuition_spending = tuition[tuition$Spending < 20000,]
my_lm = lm(Out.Tuition ~ Spending, data = tuition_spending)
plot(tuition_spending$Spending, tuition_spending$Out.Tuition, main = "Tuition vs Spending", ylab = "Tuition", xlab = "Spending", pch = 1, cex = 1, col = "blue")
abline(my_lm)
Size versus Out.Tuition. Comment on the summary of the model, and plot it on an appropriate scatter plot, and use plot( ) to look at the diagnostics. Does the model seem to be a good fit for this data?my_lm = lm(Out.Tuition ~ Size, data=tuition)
plot(tuition_spending$Size, tuition_spending$Out.Tuition, main = "Tuition vs Size", ylab = "Tuition", xlab = "Size", col = "blue")
abline(my_lm)
plot(my_lm)
length(tuition$Spending)
## [1] 767
There does not appear to be a significant correlation between size and tuition
col = tuition$Public. What did this change? Can you use this information to explain why the regression line in (a) did not fit well? # YOUR CODE HERE
Size versus Out.Tuition for private schools and for public schools. Plot both of these, appropriately colored, on your scatterplot. Comment on the output and diagnostics. # YOUR CODE HERE
We have seen that a college’s tuition relates to its size, its spending per student, and its average SAT score. We have also seen that this relationship may change based on whether the school is public or private. Ideally, instead of making separate regressions for each relationship, we could combine them all into a multiple regression. Fortunately, R makes this easy with lm().
mult.1 <- lm(Out.Tuition ~ Size + Avg.SAT + Avg.ACT + Spending + Acc.Rate, data = tuition)
summary(mult.1)
Public to the regression. The following two models are slightly different, but give essentially identical output. What is the difference between them, and why is it important even though the output still the same? mult.2 <- lm(Out.Tuition ~ Size + Avg.SAT + Avg.ACT + Spending + Public, data = tuition)
mult.3 <- lm(Out.Tuition ~ Size + Avg.SAT + Avg.ACT + Spending + factor(Public), data = tuition)
plot( ) to look at diagnostics for mult.3, and comment. # YOUR CODE HERE
pairs( ), and then use it on the variables involved in mult.3. Hint: You can use the option col = tuition$Public in pairs( ) # YOUR CODE HERE
# YOUR CODE HERE
We saw in 12c that whether a school is public or private can affect not only the tuition, but also how the tuition relates to other variables. In a multiple regression, this effect can be captured through interaction terms, which are expressed by var1:var2, and measure how much one variable changes the effect of the other.
Read the following paragraph from the documentation ?formula for some shortcuts for including interactions:
In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.
Create your own multiple regression that predicts tuition from whichever variables you choose, as well as some interaction terms between Public and other variables. Don’t worry about using any official methods to pick variables; simply try a few things and choose the model that seems best. Interpret the results; in particular, think very carefully about what the coefficient for an interaction term with Public might mean.
# YOUR CODE HERE